ID5059 P1: Used Cars Entry Price Prediction

STUDENT ID: 210015564

Frame the Problem

We are tasked with using the used cars dataset from Kaggle, collected by Austin Reese, which contains 45K+ used car seller entries from Craigslist to build a model of entry prices. This data includes useful variables such as the company and make of the car, the drive, type, transmission, colour and location and so on and so forth.

The goal of the project is to predict the entry price by we need to exploring visualisations, observations, and eventually, a regression model that can learn from this data to make the prediction using supervised learning. Understanding this prediction will help individuals and businesses how much a car will be valued after years of use and depreciation, information that could help in the consideration when purchasing new cars among other use cases.

Import Packages and Libraries

In [4]:
import opendatasets as od
import math 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import *

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import f_regression, SelectKBest
from scipy.stats import uniform, truncnorm, randint

from sklearn import metrics
from sklearn.metrics import mean_absolute_error as mae
from joblib import dump, load

from sklearn.ensemble import RandomForestRegressor

Download Data Set

In [5]:
od.download("https://www.kaggle.com/austinreese/craigslist-carstrucks-data")
Skipping, found downloaded files in "./craigslist-carstrucks-data" (use force=True to force download)

Explore Data Set

Let's understand what the exact columns are and what their corresponding values look like.

In [6]:
df = pd.read_csv("/Users/mehervaswani/craigslist-carstrucks-data/vehicles.csv")
df.head()
Out[6]:
id url region region_url price year manufacturer model condition cylinders ... size type paint_color image_url description county state lat long posting_date
0 7222695916 https://prescott.craigslist.org/cto/d/prescott... prescott https://prescott.craigslist.org 6000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN az NaN NaN NaN
1 7218891961 https://fayar.craigslist.org/ctd/d/bentonville... fayetteville https://fayar.craigslist.org 11900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ar NaN NaN NaN
2 7221797935 https://keys.craigslist.org/cto/d/summerland-k... florida keys https://keys.craigslist.org 21000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN fl NaN NaN NaN
3 7222270760 https://worcester.craigslist.org/cto/d/west-br... worcester / central MA https://worcester.craigslist.org 1500 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ma NaN NaN NaN
4 7210384030 https://greensboro.craigslist.org/cto/d/trinit... greensboro https://greensboro.craigslist.org 4900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN nc NaN NaN NaN

5 rows × 26 columns

In [7]:
df.info()
df.nunique(axis=0)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  paint_color   296677 non-null  object 
 19  image_url     426812 non-null  object 
 20  description   426810 non-null  object 
 21  county        0 non-null       float64
 22  state         426880 non-null  object 
 23  lat           420331 non-null  float64
 24  long          420331 non-null  float64
 25  posting_date  426812 non-null  object 
dtypes: float64(5), int64(2), object(19)
memory usage: 84.7+ MB
Out[7]:
id              426880
url             426880
region             404
region_url         413
price            15655
year               114
manufacturer        42
model            29667
condition            6
cylinders            8
fuel                 5
odometer        104870
title_status         6
transmission         3
VIN             118264
drive                3
size                 4
type                13
paint_color         12
image_url       241899
description     360911
county               0
state               51
lat              53181
long             53772
posting_date    381536
dtype: int64

From these few functions, we observe a total of 426,880 entries. Only id, url, region, region_url, price and state, however, do not contain any missing values. The remaining values contain missing or insufficient values.

We have several numerical and categorical variables ("object") include condition, cylinder, fuel type, transmission type, drive type, size of car, paint_color, car type, and states (51 - USA).

In [8]:
df.describe()
Out[8]:
id price year odometer county lat long
count 4.268800e+05 4.268800e+05 425675.000000 4.224800e+05 0.0 420331.000000 420331.000000
mean 7.311487e+09 7.519903e+04 2011.235191 9.804333e+04 NaN 38.493940 -94.748599
std 4.473170e+06 1.218228e+07 9.452120 2.138815e+05 NaN 5.841533 18.365462
min 7.207408e+09 0.000000e+00 1900.000000 0.000000e+00 NaN -84.122245 -159.827728
25% 7.308143e+09 5.900000e+03 2008.000000 3.770400e+04 NaN 34.601900 -111.939847
50% 7.312621e+09 1.395000e+04 2013.000000 8.554800e+04 NaN 39.150100 -88.432600
75% 7.315254e+09 2.648575e+04 2017.000000 1.335425e+05 NaN 42.398900 -80.832039
max 7.317101e+09 3.736929e+09 2022.000000 1.000000e+07 NaN 82.390818 173.885502

From this basic statistical description, we can observe the following of the numerical variables:

  1. The price has a range of 0 - 4b, which means some cars were sold for free and some cars were sold at extremely hefty prices. This is an outlier that will need to be handled.
  2. The original cars purchased date all the way back to 1900.
  3. The odometer has a range of 0 to (1 * 10^7)m which is also an outlier.
  4. The county contains only 0s.

Visualise and Clean Data Set

We can start to do some cleaning and transformation to filter for signals among the noise. Firstly, we can logically remove variables we don't need simply because the success of our model relies on selecting a good set of relevant features ("feature engineering"), which involves selecting the most useful features and/or combinining features to become more useful. We also plot some visualisations to understand linearity and the relationship between the dependent variables and the independent variables.

In [9]:
#> Remove variables we don't need based on common sense ('id', 'url', 'region', 'region_url', 'title_status', 'VIN','image_url', 'description', 'county', 'posting_date'). 

df = df.drop(columns = ['id', 'url', 'region', 'region_url', 'title_status', 'VIN','image_url', 'description', 'county', 'posting_date'])

Secondly, we clean up the outliers in price, year and odometer because these make it difficult for the model to detect underlying patterns. Rather than discarding and losing data, however, we choose to fix the errors manually.

In [10]:
sns.distplot(df['price'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[10]:
<AxesSubplot:xlabel='price', ylabel='Density'>
In [11]:
fig, ax = plt.subplots(figsize=(12,4))
ax.set_title('Box Whisker Plot to Identify Outliers in Prices')
sns.boxplot(x= df['price'])
Out[11]:
<AxesSubplot:title={'center':'Box Whisker Plot to Identify Outliers in Prices'}, xlabel='price'>

We can identify outliers from the skewed values on the left and right of the graph. Some of these extreme values make it even harder to see the remainder of the values. We remove these outliers by using the interquartile range. These extreme values are possible because the data is scraped from real-world entries where typos in the entries are likely.

In [12]:
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3-Q1

filtered_df = (df['price'] >= Q1 - 1.5 * IQR)  & (df['price'] <= Q3 + 1.5 * IQR)

old_size = df.count()['price']
df = df.loc[filtered_df]
new_size = df.count()['price']
print(old_size-new_size, '(', '{:.2f}'.format(100*(old_size-new_size)/old_size), '%',')', 'outliers removed from dataset')
8177 ( 1.92 % ) outliers removed from dataset

With almost a 2% data loss, we get a clearer distribution of the prices. We can see there is still a large number of free cars (price = 0), but since this is still a possibility in the real world, we keep it instead of removing it.

In [13]:
sns.distplot(df['price'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[13]:
<AxesSubplot:xlabel='price', ylabel='Density'>

The odometer also has outliers. Similar to the outliers in price, this could be a result of typing errors or simply mistakes in the samples. Lesser mileage is a possibility and so are higher mileages.

In [14]:
plt.figure(figsize=(20,10))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
ax = sns.scatterplot(x=df['odometer'], y=df['price'])
In [15]:
fig, ax = plt.subplots(figsize=(12,4))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
sns.boxplot(x= df['odometer'])
Out[15]:
<AxesSubplot:title={'center':'Box Whisker Plot to Identify Outliers in Odometer'}, xlabel='odometer'>

Since our outliers begin after 0.375, we can drop the values that exceed that. We use the IQR method to narrow down the range again.

In [16]:
Q1 = df['odometer'].quantile(0.25)
Q3 = df['odometer'].quantile(0.75)
IQR = Q3-Q1

filtered_df = (df['odometer'] <= Q3 + 3 * IQR)

old_size = df.count()['odometer']
df = df.loc[filtered_df]
new_size = df.count()['odometer']
print(old_size-new_size, '(', '{:.2f}'.format(100*(old_size-new_size)/old_size), '%',')', 'outliers removed from dataset')
1531 ( 0.37 % ) outliers removed from dataset
In [17]:
plt.figure(figsize=(20,10))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
ax = sns.scatterplot(x=df['odometer'], y=df['price'])
In [18]:
sns.distplot(df['odometer'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[18]:
<AxesSubplot:xlabel='odometer', ylabel='Density'>

We can observe an inverse relationship whereby cars with higher mileage cost less and vice versa. This makes sense in the real world. While there are many cars that were sold for free and have 0 mileage, we will keep these values because there may be exceptional situations in the the real world where used/new cars are given for free / for very little (who wouldn't want to live in a world like this). This also allows us to cater for a variety of cases and to prevent overfitting our model. We will narrow the dataset simply by focusing on removing the extreme values that skew the dataset.

We also want to set our range for the year to the last 50 years (1970-2020) to minimise the instability in the prices. Cars that were bought earlier than 50 years ago may no longer be representative of the market today because some of them may even be considered collectibles. After narrowing this range, we can observe a positive correlation between the year and the price.

In [19]:
df = df[df['year'].between(1970,2021)]
In [20]:
sns.distplot(df['year'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[20]:
<AxesSubplot:xlabel='year', ylabel='Density'>
In [21]:
fig, ax= plt.subplots(figsize=(45,20), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['year'], y=df['price'], ci = None, palette='magma')
Out[21]:
<AxesSubplot:xlabel='year', ylabel='price'>

Finally, we can plot a correlation matrix to identify the strengths between all the numerical variables as compared to the entry price.

In [22]:
corr = df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr, annot=True, vmin=-1,vmax=1)
plt.show()
In [23]:
fig, ax = plt.subplots(figsize=(35,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['year'], y=df['odometer'], ci=None, palette='rocket')
Out[23]:
<AxesSubplot:xlabel='year', ylabel='odometer'>

There seems to be a correlation between year and odometer, likely because the older cars would increase in mileage as the years increased but newer model cars have not cumulated as much mileage. While we can deep dive into whether this colinearity affects our model by calculating the Variance Inflation Factors (VIF), and if severe enough then we will have to drop one of the two variables but for the purposes of this project, we will assume that the VIF values are within acceptable bounds.

Next, we can summarise and visualise the categorical variables to understand their relationship with the target variable.

In [24]:
fig, axes = plt.subplots(1,2,figsize=(25,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['condition'], y=df['price'],ci=None, color='lightcoral', ax=axes[0])
sns.barplot(x= df['transmission'], y=df['price'], ci=None, color='lightgreen',ax=axes[1])
Out[24]:
<AxesSubplot:xlabel='transmission', ylabel='price'>

Cars in better condition sell for better than cars in poorer conditions. In terms of tranmission, automatic cars compare better than manual albeit not significantly. These transmission types however pale in comparison to the 'other', which could include Continuously Variable Tranmission (CVT) or semi-automatic or dual clutch tranmission types.

In [25]:
fig, axes = plt.subplots(1,3,figsize=(35,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['paint_color'], y=df['price'], ci=None, palette='rocket', ax=axes[0])
sns.barplot(x= df['drive'], y=df['price'], ci=None, palette='mako', ax=axes[1])
sns.barplot(x= df['fuel'], y=df['price'], ci=None, palette='viridis', ax=axes[2])
Out[25]:
<AxesSubplot:xlabel='fuel', ylabel='price'>

There is an assumption that some colors cost more than others and while this is the case for cars that are white, black, red or orange, there is less variance in prices than expected.

The 4-wheel drive also performs slightly better than rear-wheel drives while front-wheel drives are more affordable. If we compare the drive type to the year, we can see that rwds were the default option until the 4wd gained more popularity since the 90s.

Diesel fuelled cars cost more than gas and hybrid options likely because their engines are typically more expensive and are used by larger sized vehicles. Electric cars cost almost equally as much as diesel.

In [26]:
fig, axes = plt.subplots(1,3,figsize=(35,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['size'], y=df['price'], ci=None, palette='rocket', ax=axes[0])
sns.barplot(x= df['type'], y=df['price'], ci=None, palette='mako', ax=axes[1])
sns.barplot(x= df['cylinders'], y=df['price'], ci=None, palette='viridis', ax=axes[2])
Out[26]:
<AxesSubplot:xlabel='cylinders', ylabel='price'>

Larger cars tend to be more expensive and this can be explained by the type of cars. Pickups, trucks, SUVs, and coupes are likely to be more expensive because of the manufacturers and the size. Surprisingly large vehicles like buses, mini-vans, and wagons do not necessarily cost more but this is likely because they are used rather than new. We could compare this with the year to identify any correlation between age and price.

Cars with 6, 8 or 10 cylinders tend to be more expensive, while 4 and 5-cylinder cars are cheaper.

In [27]:
fig, ax = plt.subplots(figsize=(35,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['manufacturer'], y=df['price'], ci=None, palette='rocket')
Out[27]:
<AxesSubplot:xlabel='manufacturer', ylabel='price'>

We can observe a correlation between price and high-end car manufacturer such as tesla, jaguar, porsche, rover, aston martin, audi, etc. These car manufacturers also indicate the presence of outliers potentially because these manufacturers produce car models that cost significantly more than others. Nonetheless, a large proportion of the data set is monopolised by low to medium budget car manufacturers.

In [28]:
fig, ax = plt.subplots(figsize=(25,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['state'], y=df['price'], ci=None, palette='rocket')
Out[28]:
<AxesSubplot:xlabel='state', ylabel='price'>
In [29]:
df.plot(kind="scatter", x="long", y="lat", alpha=0.4,   s=df["price"]/1000000, label="price", figsize=(10,7),c="price", cmap=plt.get_cmap("jet"), colorbar=True,  )
plt.xlim(-150,0)
plt.legend()
Out[29]:
<matplotlib.legend.Legend at 0x32fd73b20>
In [30]:
df = df.drop(columns=['lat','long'])

Deep Dive into Data

We can further deep dive into this data to answer some business questions and gain insights to what factors significantly influence the target variable.

In [33]:
fig, ax = plt.subplots(figsize=(25,15), sharey=True)
fig.suptitle('How is condition influenced by mileage?')
sns.barplot(x= df['condition'], y=df['odometer'], ci=None, palette='rocket')
Out[33]:
<AxesSubplot:xlabel='condition', ylabel='odometer'>
In [34]:
%matplotlib inline
fig, ax = plt.subplots(figsize=(35,10), sharey=True)
fig.suptitle('What types of cars does each manufacturer sell most?')
sns.histplot(binwidth=0.2, x=df['manufacturer'], hue=df['type'], data=df, stat="count", multiple="stack")
Out[34]:
<AxesSubplot:xlabel='manufacturer', ylabel='Count'>
In [47]:
fig, ax = plt.subplots(figsize=(35,20), sharey=True)
Company_Kilometers_Driven = df.groupby('manufacturer').odometer.mean()
Company_Kilometers_Driven.plot(kind='bar')
plt.xlabel("s")
plt.ylabel("s")
plt.title("What is the average mileage of the car before it is sold?")
plt.show()
In [36]:
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Is there a preference for what kind of drive is chosen each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'drive')
Out[36]:
<AxesSubplot:title={'center':'Is there a preference for what kind of drive is chosen each year?'}, xlabel='year', ylabel='price'>
In [37]:
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('What type of car sells most each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'type')
Out[37]:
<AxesSubplot:title={'center':'What type of car sells most each year?'}, xlabel='year', ylabel='price'>
In [38]:
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Has there been a change in fuel prices over the past few years?')
sns.scatterplot(x='year', y='price', data=df, hue = 'fuel')
Out[38]:
<AxesSubplot:title={'center':'Has there been a change in fuel prices over the past few years?'}, xlabel='year', ylabel='price'>
In [39]:
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Is there a preference for what kind of drive is chosen each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'transmission')
Out[39]:
<AxesSubplot:title={'center':'Is there a preference for what kind of drive is chosen each year?'}, xlabel='year', ylabel='price'>

Data Pre-Processing

Based on the situation that there are plenty of null values in our dataset and these missing values are hard to fill with accurate guesses. Since none of the numerical variables have missing values, we can focus on adjusting only the categorical variables. To do this we take the following three actions:

For columns that have > 40% in missing values, we remove the whole column because too many missing values .

For columns that have < 40%, we replace the values by categorising it as 'other' instead of removing more data.

In [40]:
plt.figure(figsize=(10,6))
sns.displot(
    data=df.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x375401330>
<Figure size 720x432 with 0 Axes>
In [49]:
null_values = df.isna().sum()

def na_filter(na, threshold = 0.4):
    column = []
    for i in na.keys():
        if na[i]/df.shape[0] < threshold:
            column.append(i)
    return column

df = df[na_filter(null_values)]
df.columns
Out[49]:
Index(['price', 'year', 'manufacturer', 'model', 'fuel', 'odometer',
       'transmission', 'drive', 'type', 'paint_color', 'state'],
      dtype='object')

Remaining columns after removing any column with >40% missing values. We now replace remaining missing values with 'other'

In [50]:
plt.figure(figsize=(10,6))
sns.displot(
    data=df.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25)
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x3aceb4580>
<Figure size 720x432 with 0 Axes>
In [51]:
df = df.replace(np.nan, 'other', regex=True)

Check for any final missing values.

In [52]:
plt.figure(figsize=(10,6))
sns.displot(
    data=df.isnull().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25)
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x3acc27340>
<Figure size 720x432 with 0 Axes>

Let's take a look at what our data looks like and how many unique values each variable contains.

In [53]:
df.head()
Out[53]:
price year manufacturer model fuel odometer transmission drive type paint_color state
27 33590 2014.0 gmc sierra 1500 crew cab slt gas 57923.0 other other pickup white al
28 22590 2010.0 chevrolet silverado 1500 gas 71229.0 other other pickup blue al
29 39590 2020.0 chevrolet silverado 1500 crew gas 19160.0 other other pickup red al
30 30990 2017.0 toyota tundra double cab sr gas 41124.0 other other pickup red al
31 15000 2013.0 ford f-150 xlt gas 128000.0 automatic rwd truck black al
In [54]:
df.nunique(axis=0)
Out[54]:
price            14309
year                52
manufacturer        43
model            26959
fuel                 5
odometer        102580
transmission         3
drive                4
type                13
paint_color         13
state               51
dtype: int64

Within categorical variables, we see that there are 26959 unique entries under model. We'll drop this column from our dataset because the large number of distinct values increases the dimensionality in our dataset especially when we proceed to encoding our categorical variables. It would only increase our dataset by 26959 variables. Additionally, since we have 'manufacturer', whereby one model can only belong to one manufacturer, a certain element of multicolinearity will be reintroduced as there is surely a relationship between car's make and its model.

In [55]:
df = df.drop(columns=['model'])

Since we can only select four predictor variables, we will use year, odometer, manufacturer, and type. The remaining columns will be dropped. This combination of variables was selected based on a trial and error to see which combination of variables will yield greater accuracy but since accuracy is not the goal for this practical, we will resume at this combination which yields a satisfactory result as we will see later.

In [56]:
df = df.drop(columns=['transmission','fuel','state','paint_color','drive'])

'Condition' was the only variable that was ordinal but since we removed it for having >40% missing values, the remaining variables are nominal (order is irrelevant). For this reason, we create dummy variables for one-hot encoding of categorical variables instead of any other form of encoding. This prevents the model from identifying any order between the values of the variables.

In [57]:
catColumns = ['manufacturer','type']
for column in catColumns: 
    column = pd.get_dummies(df[column],drop_first=True) 
    df = pd.concat([df,column],axis=1)
df = df.drop(columns = catColumns) 

Create a Train-Test Split

While we could choose a stratified sampling split, we select a randomised sampling split using scikit's train_test_split() because there is not sufficient information known about the demographic of our dataset. The only distinct characteristic we have to distinguish our population is location. Stratified sampling in this case would be better suited for classification problems. The test data set represents 20% of the original data set and we ensure that both the train and test sets have an identical number of variables.

In [58]:
X_train, X_test, y_train, y_test= train_test_split(df.drop('price',axis=1), 
                                                    df['price'],test_size=0.20, 
                                            random_state=5564)
In [59]:
df = X_train.copy()
df_test = X_test.copy()
In [60]:
df_train_labels = y_train.copy()
df_test_labels = y_test.copy()

We standardize our numerical variables within the train and test set so that they can be evaluated equally with our categorical variables. We perform this only after the split because if we did it before the train-test split, the pre-processed standardization would introduce a data leak if the data is split i.e. the global norm/mean would be indirectly introduced in the test set.

In [61]:
scaler = StandardScaler()

for column in ['year','odometer']:
    df[column] = scaler.fit_transform(df[column].values.reshape(-1,1))

We perform the standardization of the train and test seperately to prevent any carryover.

In [62]:
std_Scaler = StandardScaler()

for column in ['year','odometer']:
    df_test[column] = scaler.fit_transform(df_test[column].values.reshape(-1,1))

Lastly, we check that both our train and test sets have identical number of variables from the encoding and its values are comparable across the variables.

In [63]:
df.head()
Out[63]:
year odometer alfa-romeo aston-martin audi bmw buick cadillac chevrolet chrysler ... coupe hatchback mini-van offroad other pickup sedan truck van wagon
226291 -0.828528 -0.125693 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
408968 -0.828528 1.447829 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
92590 -0.545020 0.273422 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17370 -0.261513 2.212647 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
322921 -0.686774 1.030101 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 56 columns

In [64]:
df_test.head()
Out[64]:
year odometer alfa-romeo aston-martin audi bmw buick cadillac chevrolet chrysler ... coupe hatchback mini-van offroad other pickup sedan truck van wagon
228759 -0.969290 0.414029 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
386312 0.449902 0.142840 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7466 0.875660 -0.848805 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
415324 0.449902 0.047126 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
136507 -4.659189 -1.324774 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0

5 rows × 56 columns

Building our Model

Linear Regression and Random Forest Regressor (9 Variables)

Screenshot%202022-03-09%20at%2013.04.03.png

Screenshot%202022-03-09%20at%2013.11.41.png

Screenshot%202022-03-09%20at%2013.14.03.png

Screenshot%202022-03-09%20at%2013.11.49.png

Our linear regression model performs equally well (46% accuracy) on both train and test with nine variables, indicating that overfitting is not an issue. We use this as a comparison of performance for our Random Forest with its default n_estimators = 100, which yields a higher accuracy on training (96.7%) than on test (78%). Our MAE and RMSE scores tend to be relatively high but improves with the Random Forest model. We can reduce the errors through feature engineering, applying other algorithms, and model hyper parameter tuning.

Next, we select our top 4 features using the feature_importance(). While not intuitive at first, we can observe that year, odometer, type and manufacturer perform the best. Type and manufacturer perform well cumulatively as an attribute in its entirety rather than by single variables as a result of our one-hot encoding. Instead of type, we could also try predicting with year, odometer, manufacturer, and drive. But for the purposes of this practical since our accuracy values are not priority, we can continue to test our models with the initially decided four variables.

Screenshot%202022-03-09%20at%2013.14.25.png

Screenshot%202022-03-09%20at%2013.13.02.png

Unknown.png

Linear Regression and Random Forest Regressor (4 Variables)

In [69]:
from sklearn.linear_model import LinearRegression
lrmodel = LinearRegression()
lrmodel.fit(df,df_train_labels)
y_pred = lrmodel.predict(df_test)
In [66]:
Acc = pd.DataFrame(index=None, columns=['Model','Mean Absolute Error','Root Mean Squared  Error','Accuracy on Traing set','Accuracy on Testing set'])
In [70]:
name = 'Linear Regression'
MAE = round(metrics.mean_absolute_error(df_test_labels,y_pred),2)
RMSE = np.sqrt(metrics.mean_squared_error(df_test_labels, y_pred))
ATrS =  lrmodel.score(df,df_train_labels)
ATeS = lrmodel.score(df_test,df_test_labels)
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared  Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
/var/folders/sr/mv84b_gn0x599hl3rl_bfpyr0000gn/T/ipykernel_18273/3362433715.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared  Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
In [71]:
Acc
Out[71]:
Model Mean Absolute Error Root Mean Squared Error Accuracy on Traing set Accuracy on Testing set
0 Linear Regression 7369.94 10248.662828 0.384282 0.385226

Our linear regression model performs worse with lesser variables, which is understandable. Although our MAE and RMSE remain fairly consistent despite lesser variables, which is a good indicator because it means the number of features is not the problem.

For our random forest with four variables, we will implement model hyperparameter tuning for cross validation to see if this performs better at least against the linear regression.

We store our possible hyperparameter distributions into a dictionary to be passed to RandomizedSearchCV. We are selecting to narrow down only 2 hyperparameters (n_estimators and min. sample split) because we have already defined the maximum number of features.

For our n_estimators, we select a random number between 4 and 200 and a uniform distribution for the minimum number of splits.

In [72]:
model_params = {
    'n_estimators': randint(4,200),
    'min_samples_split': uniform(0.01, 0.199)
}

We call the RF() and set up our Randomized Search by selecting the model, passing on our model parameters, selecting the number of models, and number of folds for it to cross validate. Each iteration uses a new model trained on a new draw from our dictionary of parameters. The number of folds determines how many times it will train each model on a different subset of data to improve model quality. The total number of models random search trains is then the number of iterations multiplied by folds. The output is the best set of hyperparameters found from all its models to be used. This is a time consuming process so brace yourself.

In [73]:
rf2 = RandomForestRegressor()
clf = RandomizedSearchCV(rf2, model_params, n_iter=20, cv=5, random_state=5564)
model = clf.fit(df,df_train_labels)
from pprint import pprint
pprint(model.best_estimator_.get_params())
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 0.015450131046387306,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 54,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Use our tuned model to predict our test set.

In [74]:
y_pred = model.predict(df_test)
In [75]:
Acc = pd.DataFrame(index=None, columns=['Model','Mean Absolute Error','Root Mean Squared  Error','Accuracy on Traing set','Accuracy on Testing set'])
In [78]:
name = 'Random Forest Regressor'
MAE = round(metrics.mean_absolute_error(df_test_labels,y_pred),2)
RMSE = np.sqrt(metrics.mean_squared_error(df_test_labels, y_pred))
ATrS =  model.score(df,df_train_labels)
ATeS = model.score(df_test,df_test_labels)
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared  Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
/var/folders/sr/mv84b_gn0x599hl3rl_bfpyr0000gn/T/ipykernel_18273/974868182.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared  Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
In [79]:
Acc
Out[79]:
Model Mean Absolute Error Root Mean Squared Error Accuracy on Traing set Accuracy on Testing set
0 Random Forest Regressor 6823.82 9771.298524 0.448848 0.441162

Our Random Forest performs better than our Linear Regression in terms of accuracy across both sets although our Random Forest seems to be overfitting given the discrepancy between accuracies in training and test. On a positive note, our MAE and RMSE do not deviate too much from the linear regression model, which is a good indicator because it is not a problem with our model either but it could be a problem with removing variables (especially as we compare with our Random Forest with nine variables). We can likely reduce our error through feature engineering, by transforming/scaling our features or managing the outliers.

In [80]:
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(df_test_labels, y_pred)
Out[80]:
<matplotlib.collections.PathCollection at 0x144ebf460>